Skip to content

Conversation

@tenpercent
Copy link
Contributor

@tenpercent tenpercent commented Jan 16, 2026

Summary

  • Add find_in_tuple_of_sequences compile-time search helper with O(1) template depth
  • Replace nested static_for lambdas in TensorDescriptor::GetTransformAndItsUpperDimension
  • Replace generate_tuple lambda in TensorDescriptor::InitializeElementSize with pack expansion
  • Apply same optimizations to TensorAdaptor

Motivation

The TensorDescriptor and TensorAdaptor classes had excessive template instantiation from:

  1. Nested static_for loops with lambdas (918 applier::operator() instantiations)
  2. generate_tuple with lambdas (78+ instantiations per class)

Why It Works

Each lambda creates a unique closure type, causing separate instantiations at every call site. The find_in_tuple_of_sequences helper uses O(1) template depth via pack expansion instead of O(N) nested static_for recursion, and named functors share a single type across all uses.

Results (example_grouped_conv_fwd_xdl_fp16)

Metric Before After Improvement
Template instantiation time 23.4s 19.1s 18% reduction
applier instantiations 1132 127 89% reduction
generate_tuple lambdas 178 96 46% reduction

Test Plan

  • Added 11 unit tests:
    • 5 tests for sequence_find_value
    • 6 tests for find_in_tuple_of_sequences
  • Waiting for full CI

PR Stack

This PR is part of the build time optimization effort (issue #3575). All PRs now target develop independently:

# PR Description Status
1 #3585 sequence_gen with __make_integer_seq Independent
2 #3628 generate_identity_sequences + named functors New (replaces #3588, #3589)
3 #3590 container_concat optimization Independent
4 #3596 O(1) pack expansion rewrites Independent
5 #3600 TensorDescriptor/TensorAdaptor lambda elimination This PR

Tracking issue: #3575

@tenpercent tenpercent force-pushed the mpodkory/find-transform-optimization branch from 1d351ad to ec8e794 Compare January 21, 2026 23:57
@tenpercent tenpercent changed the base branch from mpodkory/recursive-to-pack-expansion to develop January 22, 2026 01:05
The GetTransformAndItsUpperDimension function used nested static_for
loops with lambdas to search for a hidden dimension in UpperDimensionIdss.
This caused 918 applier::operator() instantiations (81% of all applier
instantiations).

Replace with find_in_tuple_of_sequences helper that uses constexpr
array lookup and if-constexpr recursion, eliminating the lambda
instantiation overhead.

Results on example_grouped_conv_fwd_xdl_fp16:
- applier instantiations: 1132 -> 127 (89% reduction)
- TensorDescriptor instantiations: 2503 -> 664 (73% reduction)
- Template instantiation time: 23.4s -> 19.4s (17% reduction)
…tSize

The InitializeElementSize function used generate_tuple with a lambda to
compute visible dimension lengths. Each TensorDescriptor type created
a unique lambda type, causing 78 instantiations (385ms).

Replace with direct pack expansion using helper functions, eliminating
the lambda instantiation overhead entirely.

Results on example_grouped_conv_fwd_xdl_fp16:
- generate_tuple lambdas: 178 -> 100 (44% reduction)
- Template instantiation time: 19.5s -> 19.0s
TensorAdaptor has identical InitializeElementSize and
GetTransformAndItsUpperDimension patterns as TensorDescriptor.
Apply the same optimization:
- Replace nested static_for lambdas with find_in_tuple_of_sequences
- Replace generate_tuple lambda with pack expansion

Results: generate_tuple lambdas 100 -> 96 (4 events, 17ms eliminated)
@tenpercent tenpercent force-pushed the mpodkory/find-transform-optimization branch from ec8e794 to 83a76d7 Compare January 22, 2026 01:13
Detailed comments explain:
- sequence_find_value: Constexpr loop with O(1) template depth vs O(N) recursive
- find_in_tuple_of_sequences: Pack expansion instead of nested static_for loops
- Why constexpr search reduces template instantiations dramatically
- When to apply constexpr search patterns for compile-time operations
- Implementation details for each optimization approach

This documentation helps maintainers understand the compile-time search optimization
strategy without relying on specific benchmark numbers that may vary by use case.
@tenpercent tenpercent marked this pull request as draft January 22, 2026 18:49
@cgmillette cgmillette self-assigned this Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants